Big Data: GeoScience part 4

MSDE

Lviv University

K-Nearest Neighbor search

Details

  • K-Nearest Neighbor search techniques are also typically built on top of spatial indices to make the queries more efficient.
  • We need to use another tree structure called KD-Tree (K-dimensional tree) that can provide us information about K-nearest neighbors (i.e. not only the closest).
  • KD-tree is similar to R-tree, but the data is ordered and sorted in a bit different manner (see Appendices for further details).

K-Nearest Neighbor search

KNN search in Python

  • we will use scipy library
  • before we can do the actual query, we need to build the KD-Tree spatial index.
  • in scipy, we can use the KDTree to build the spatial index which is available from the scipy.spatial submodule.

K-Nearest Neighbor search

KNN search in Python

In the following, we use the building_points and stops GeoDataFrames that we already used earlier to find three closest public transport stops for each building.

Let’s start by reading the data and reproject the data into a metric coordinate reference system (EPSG:3067) so that the distances will be presented as meters:

import geopandas as gpd

# Read the files and reproject to EPSG:3067
stops = gpd.read_file("_data/Helsinki/pt_stops_helsinki.gpkg").to_crs(epsg=3067)
building_points = gpd.read_file("_data/Helsinki/building_points_helsinki.zip").to_crs(
    epsg=3067
)

building_points.head(2)
name geometry
0 None POINT (381166.6 6676424.438)
1 Uimastadion POINT (385236.565 6674238.472)

K-Nearest Neighbor search

KNN search in Python

stops.head(2)
stop_name stop_lat stop_lon stop_id geometry
0 Ritarihuone 60.16946 24.95667 1010102 POINT (386623.301 6672037.884)
1 Kirkkokatu 60.17127 24.95657 1010103 POINT (386623.991 6672239.572)
stops.shape
(8377, 5)

K-Nearest Neighbor search

KNN search in Python

  • As a first step, we need to build a KDTRee index structure based on the Point coordinates.
  • The KDTree class expects the Point coordinates to be in array format, i.e. not as shapely Point objects which we have stored in the geometry column.
  • Luckily, it is very easy to convert the shapely geometries into numpy.array format by chaining a method .get_coordinates() with the .to_numpy() method as follows:
building_coords = building_points.get_coordinates().to_numpy()
stop_coords = stops.geometry.get_coordinates().to_numpy()

stop_coords
array([[ 386623.30055697, 6672037.88387716],
       [ 386623.99053928, 6672239.57164472],
       [ 386629.00011373, 6672130.5382358 ],
       ...,
       [ 393706.51571504, 6707260.21305267],
       [ 391372.74617002, 6697807.78260742],
       [ 388733.41604041, 6714694.15542812]], shape=(8377, 2))

K-Nearest Neighbor search

KNN search in Python

We can now create a KD-Tree index structure as follows:

from scipy.spatial import KDTree

stop_kdt = KDTree(stop_coords)
stop_kdt
<scipy.spatial._kdtree.KDTree at 0x1220d8850>

Now we have initialized a KDTree index structure by populating it with stop coordinates.

K-Nearest Neighbor search

KNN search in Python

We can now use the .query() method which goes through all the input coordinates (i.e. buildings) and very quickly calculates which of them is the closest, 2nd closest etc.

The method returns the distances to the K-number of nearest neighbors as well as the index of the closest stop to the given building.

By passing an argument k=3, we can specify that we want to find three closest neighbors for each building:

# Find the three nearest neighbors from stop KD-Tree for each building
k_nearest_dist, k_nearest_ix = stop_kdt.query(building_coords, k=3)

len(k_nearest_dist)
158731

K-Nearest Neighbor search

KNN search in Python

# Distances to 3 nearest stops
k_nearest_dist
array([[ 92.67989301, 461.43820422, 466.16915044],
       [400.24336963, 409.49707253, 410.06137016],
       [109.81963349, 130.59749777, 133.6424814 ],
       ...,
       [135.34174505, 136.28586705, 274.93549394],
       [ 99.40810774, 118.1492825 , 209.42172325],
       [ 67.79042163,  71.91370036, 103.08138812]], shape=(158731, 3))
# Index values of the 3 nearest stops
k_nearest_ix
array([[1131, 1135, 1125],
       [ 467,  465,  475],
       [  61,   64,   13],
       ...,
       [4655, 4678, 4614],
       [4624, 4625, 4680],
       [4665, 4617, 4619]], shape=(158731, 3))

K-Nearest Neighbor search

KNN search in Python

  • Next, we will attach this information back to our GeoDataFrame so that it is easier to do further analyses with the data and create some nice maps out of the data.
  • The data which is returned by the stop_kdt.query() command comes out as an array of lists, where each item (list) contains three values that show the distances between three nearest stops and the given building.
  • To be able to easily merge this data with the original GeoDataFrame containing the building data, we need to transpose our arrays.
  • After the transpose, the data will be restructured in a way that there will be three arrays and each of these arrays contains the distances/stop-ids for all the buildings in a single list.

K-Nearest Neighbor search

KNN search in Python

# Make a copy
k_nearest = building_points.copy()

# Add indices of nearest stops
k_nearest["1st_nearest_idx"] = k_nearest_ix.T[0]
k_nearest["2nd_nearest_idx"] = k_nearest_ix.T[1]
k_nearest["3rd_nearest_idx"] = k_nearest_ix.T[2]

# Add distances
k_nearest["1st_nearest_dist"] = k_nearest_dist.T[0]
k_nearest["2nd_nearest_dist"] = k_nearest_dist.T[1]
k_nearest["3rd_nearest_dist"] = k_nearest_dist.T[2]
k_nearest.head()
name geometry 1st_nearest_idx 2nd_nearest_idx 3rd_nearest_idx 1st_nearest_dist 2nd_nearest_dist 3rd_nearest_dist
0 None POINT (381166.6 6676424.438) 1131 1135 1125 92.679893 461.438204 466.169150
1 Uimastadion POINT (385236.565 6674238.472) 467 465 475 400.243370 409.497073 410.061370
2 None POINT (386317.478 6672100.648) 61 64 13 109.819633 130.597498 133.642481
3 Hartwall Arena POINT (385225.109 6676120.56) 532 533 506 104.632434 137.706391 377.331985
4 Talli POINT (385079.733 6676989.745) 496 497 498 472.248282 519.685534 551.358778

K-Nearest Neighbor search

KNN search in Python

  • In the following, we create three separate GeoDataFrames that correspond to the nearest, second nearest and third nearest stops from the buildings.
  • We start by storing the stop_index as a column which allows us to easily merge the data between stops and k_nearest (buildings) GeoDataFrames.
  • For making the table join, we can use the pandas .merge() function in which we use the 1st_nearest_idx, 2nd_nearest_idx and 3rd_nearest_idx as the key on the left GeoDataFrame, while the stop_index is the key on the right GeoDataFrame.
  • We also pass the suffixes=('', '_knearest) argument to the .merge() method
# Store the stop index for making the table join
stops["stop_index"] = stops.index
# Merge the geometries of the nearest stops to the GeoDataFrame
k_nearest_1 = k_nearest.merge(
    stops[["stop_index", "geometry"]],
    left_on="1st_nearest_idx",
    right_on="stop_index",
    suffixes=("", "_knearest"),
)
k_nearest_1.head(2)
name geometry 1st_nearest_idx 2nd_nearest_idx 3rd_nearest_idx 1st_nearest_dist 2nd_nearest_dist 3rd_nearest_dist stop_index geometry_knearest
0 None POINT (381166.6 6676424.438) 1131 1135 1125 92.679893 461.438204 466.16915 1131 POINT (381256.66 6676446.317)
1 Uimastadion POINT (385236.565 6674238.472) 467 465 475 400.243370 409.497073 410.06137 467 POINT (384973.331 6674539.973)

K-Nearest Neighbor search

KNN search in Python

# Merge the geometries of the 2nd nearest stops to the GeoDataFrame
k_nearest_2 = k_nearest.merge(
    stops[["stop_index", "geometry"]],
    left_on="2nd_nearest_idx",
    right_on="stop_index",
    suffixes=("", "_knearest"),
)
k_nearest_2.head(2)
name geometry 1st_nearest_idx 2nd_nearest_idx 3rd_nearest_idx 1st_nearest_dist 2nd_nearest_dist 3rd_nearest_dist stop_index geometry_knearest
0 None POINT (381166.6 6676424.438) 1131 1135 1125 92.679893 461.438204 466.16915 1135 POINT (381625.316 6676474.488)
1 Uimastadion POINT (385236.565 6674238.472) 467 465 475 400.243370 409.497073 410.06137 465 POINT (385270.775 6674646.538)

K-Nearest Neighbor search

KNN search in Python

# Merge the geometries of the 3rd nearest stops to the GeoDataFrame
k_nearest_3 = k_nearest.merge(
    stops[["stop_index", "geometry"]],
    left_on="3rd_nearest_idx",
    right_on="stop_index",
    suffixes=("", "_knearest"),
)
k_nearest_3.head(2)
name geometry 1st_nearest_idx 2nd_nearest_idx 3rd_nearest_idx 1st_nearest_dist 2nd_nearest_dist 3rd_nearest_dist stop_index geometry_knearest
0 None POINT (381166.6 6676424.438) 1131 1135 1125 92.679893 461.438204 466.16915 1125 POINT (381632.74 6676429.668)
1 Uimastadion POINT (385236.565 6674238.472) 467 465 475 400.243370 409.497073 410.06137 475 POINT (385122.17 6674632.254)

K-Nearest Neighbor search

KNN search in Python

Now we can create LineString geometries connecting these Point objects to each other:

from shapely import LineString

# Generate LineStrings connecting the building point and K-nearest neighbor
k_nearest_1["geometry"] = k_nearest_1.apply(
    lambda row: LineString([row["geometry"], row["geometry_knearest"]]), axis=1
)
k_nearest_2["geometry"] = k_nearest_2.apply(
    lambda row: LineString([row["geometry"], row["geometry_knearest"]]), axis=1
)
k_nearest_3["geometry"] = k_nearest_3.apply(
    lambda row: LineString([row["geometry"], row["geometry_knearest"]]), axis=1
)

k_nearest_1.head(2)
name geometry 1st_nearest_idx 2nd_nearest_idx 3rd_nearest_idx 1st_nearest_dist 2nd_nearest_dist 3rd_nearest_dist stop_index geometry_knearest
0 None LINESTRING (381166.6 6676424.438, 381256.66 66... 1131 1135 1125 92.679893 461.438204 466.16915 1131 POINT (381256.66 6676446.317)
1 Uimastadion LINESTRING (385236.565 6674238.472, 384973.331... 467 465 475 400.243370 409.497073 410.06137 467 POINT (384973.331 6674539.973)

K-Nearest Neighbor search

KNN search in Python

In the following, we select a specific building and the closest stops from that building. The name column contains information about the names of the buildings which we can use to choose a building of our interest for visualization:

# Find unique building names
k_nearest.name.unique()
array([None, 'Uimastadion', 'Hartwall Arena', ..., 'Peltolan koulu',
       'Hilton Helsinki Airport', 'Posti Oy Logistiikkakeskus'],
      shape=(3010,), dtype=object)

K-Nearest Neighbor search

KNN search in Python

Thus, let’s filter the data for Hartwall Arena and create an interactive map out of the results, showing the three closest stops indicated with different colors:

# Visualize 3 nearest stops to
selected_name = "Hartwall Arena"

m = k_nearest_1.loc[k_nearest_1["name"] == selected_name].explore(
    color="red", tiles="CartoDB Positron", max_zoom=16
)
m = k_nearest_2.loc[k_nearest_2["name"] == selected_name].explore(m=m, color="orange")
m = k_nearest_3.loc[k_nearest_3["name"] == selected_name].explore(m=m, color="blue")
m = stops.explore(m=m, color="green")
m
Make this Notebook Trusted to load map: File -> Trust Notebook

K-Nearest Neighbor search

Nearest neighbors within radius

  • We aim to find and calculate the number of buildings that are within 200 meters from a given public transport stop.
  • Finding all neighbors within a specific search radius can also be done using the KD-Tree spatial index.
  • However, in this case we actually build the KDTree index for both datasets (buildings and stops) and then use a .query_ball_tree() method to find all neighbors within the radius r:
from scipy.spatial import KDTree

# Build KDTree indices
stop_kdt = KDTree(stop_coords)
building_kdt = KDTree(building_coords)

# Find the three nearest neighbors from stop KD-Tree for each building
k_nearest_ix = stop_kdt.query_ball_tree(building_kdt, r=200)
8377

K-Nearest Neighbor search

Nearest neighbors within radius

Now we have found all the building points within 200 meters from the stops (n=8377).

The following shows all the indices for the first stop at index 0:

k_nearest_ix[0]
[1129,
 1130,
 1155,
 2054,
 2055,
 2056,
... (output truncated)

K-Nearest Neighbor search

Nearest neighbors within radius

Now we can easily store the building indices as a new column to the stops GeoDataFrame:

stops["building_ids_within_range"] = k_nearest_ix
stops.head()
stop_name stop_lat stop_lon stop_id geometry stop_index building_ids_within_range
0 Ritarihuone 60.169460 24.956670 1010102 POINT (386623.301 6672037.884) 0 [1129, 1130, 1155, 2054, 2055, 2056, 2057, 205...
1 Kirkkokatu 60.171270 24.956570 1010103 POINT (386623.991 6672239.572) 1 [1130, 2054, 2055, 2057, 2058, 2059, 2066, 206...
2 Kirkkokatu 60.170293 24.956721 1010104 POINT (386629 6672130.538) 2 [1129, 1130, 2054, 2055, 2056, 2057, 2058, 205...
3 Vironkatu 60.172580 24.956554 1010105 POINT (386627.617 6672385.448) 3 [2060, 2062, 2063, 2064, 2065, 2066, 2067, 206...
4 Vironkatu 60.172990 24.956380 1010106 POINT (386619.379 6672431.394) 4 [1136, 2060, 2061, 2062, 2063, 2064, 2065, 206...

K-Nearest Neighbor search

Nearest neighbors within radius

With this information, we can for example calculate the number of buildings within 200 meters from each stop. To do this, we can create a simple lambda function that checks the length of the id-list and store the result into column building_cnt:

stops["building_cnt"] = stops["building_ids_within_range"].apply(
    lambda id_list: len(id_list)
)
stops.head()
stop_name stop_lat stop_lon stop_id geometry stop_index building_ids_within_range building_cnt
0 Ritarihuone 60.169460 24.956670 1010102 POINT (386623.301 6672037.884) 0 [1129, 1130, 1155, 2054, 2055, 2056, 2057, 205... 50
1 Kirkkokatu 60.171270 24.956570 1010103 POINT (386623.991 6672239.572) 1 [1130, 2054, 2055, 2057, 2058, 2059, 2066, 206... 69
2 Kirkkokatu 60.170293 24.956721 1010104 POINT (386629 6672130.538) 2 [1129, 1130, 2054, 2055, 2056, 2057, 2058, 205... 56
3 Vironkatu 60.172580 24.956554 1010105 POINT (386627.617 6672385.448) 3 [2060, 2062, 2063, 2064, 2065, 2066, 2067, 206... 74
4 Vironkatu 60.172990 24.956380 1010106 POINT (386619.379 6672431.394) 4 [1136, 2060, 2061, 2062, 2063, 2064, 2065, 206... 78

K-Nearest Neighbor search

Nearest neighbors within radius

print("Max number of buildings:", stops["building_cnt"].max())
print("Average number of buildings:", stops["building_cnt"].mean().round(1))
Max number of buildings: 181
Average number of buildings: 32.2

K-Nearest Neighbor search

Nearest neighbors within radius

By calculating simple statistics from the building_cnt column, we can see that on average there are 32.2 buildings within 200 meters from the public transport stops and the maximum number of buildings within this distance is whopping 181 buildings.

  • This indicates very dense neighborhood having numerous buildings in a small area.
  • To better understand, where this kind of neighborhood is located and what does it look like, we can make a map by selecting the rows with highest number of buildings and then plotting the stop and building points within radius:
filtered = stops["building_cnt"] == stops["building_cnt"].max()
building_ids = stops.loc[filtered].building_ids_within_range.values[0]

m = stops.loc[filtered].explore(
    tiles="CartoDB Positron", color="red", marker_kwds={"radius": 5}, max_zoom=16
)
building_points.loc[building_ids].explore(m=m)
Make this Notebook Trusted to load map: File -> Trust Notebook

K-Nearest Neighbor search

Exercise

There is also an alternative approach for making a radius query by calculating a buffer around the stop points and then making a spatial join between these Polygon geometries and the buildings. This approach also allows to make queries between other type of geometries than Points.

  • Test and try to find all buildings within 200 meters from the transit stops by creating a 200 meter buffer around the stops and then making a spatial join between the buffers and building points.
  • Calculate the number of buildings per stop_id.
  • Did it take longer to find the nearest buildings using this approach?